class: center, middle, inverse, title-slide .title[ # Introduction to Text Analysis with R ] .subtitle[ ## ⚔
using xaringan ] .author[ ### Alisson Soares ] .institute[ ### Latin R 2021 ] .date[ ### 11 de noviembre de 2021 ] --- <!-- <style type="text/css"> --> <!-- .remark-slide-content { --> <!-- font-size: 25px; --> <!-- padding: 1em 4em 1em 4em; --> <!-- } --> <!-- </style> --> <style type="text/css"> .large { font-size: 130% } </style> <style type="text/css"> .small { font-size: 0% } </style> <!-- Simply enclose the text you want to change in your Presentation in .small[some text here] or .large[textao]. --> # Introduction This tutorial will introduce some principles of text analysis using R. - 1) Some examples of text analysis in practice in social research. - 2) Some basic text analysis in graphical mode (just mouse clicks, no coding) Although easy, this method has several limitations, so we go to the third and more extensive part, that will deal with text - 3) analysis writing code. --- ## Who am I? I'm Alisson Soares, sociologist, recently diving into Text Analysis and Computational Social Science - Twitter: [alissonmasoares](https://twitter.com/alissonmasoares) - GitHub: [https://github.com/SoaresAlisson/](https://github.com/SoaresAlisson/). - email [alissonmsoares@gmail.com](alissonmsoares@gmail.com) - My bookdown [Introdução à Análise Textual](https://soaresalisson.github.io/analisetextual/) (in portuguese) -- - Groups in Telegram: - [R e Rstudio Humanidades (t.me/R_humanidades)](https://t.me/R_humanidades) - [Análise Textual-Humanidades Digitais-Grupo (t.me/humanidadesdigitiais)](https://t.me/humanidadesdigitiais) --- # Text Analysis or text mining --- class: chapter-slide # Examples of Text Analysis in research --- ## Examples of Text Analysis - plagiarianism detection -- - Linguistic fingerprint". find autorship of some textes: - Federalist Papers ([Where GREP Came From - Computerphile](https://www.youtube.com/watch?v=NTfOnGZUZDk).) - Finding the creator of Bitcoin https://www.forbes.com/sites/billybambrough/2020/05/05/john-mcafee-thinks-hes-solved-bitcoins-greatest-mystery-who-is-satoshi-nakamoto/ <!-- - non-published works from Shakespeare --> ??? Na década de 1960, Quem são os autores de cada texto dos textos dos Federalistas? The _Federalist Papers_ are 85 texts publicados between 1787 e 1788 de modo anônimo sob pseudônimo de "Publius", podendo ser de James Madinson, Alexander Hamilton, John Jay. Lee McMahon comentou deste problema com seus colegas. Ele queria buscar por certas palavras através de vários textos. Ken Thompson - personagem importante na história da programação - escutou aquilo e no dia seguinte voltou com o programa que veio a ser chamado GREP --- ## Examples of Text Analysis in Research Article: Robert C. Brooks; Daniel Russo-Batterham, Khandis R. Blake. 2022. ["Incel Activity on Social Media Linked to Local Mating Ecology"](https://www.gwern.net/docs/sociology/technology/2022-brooks.pdf) Psychological Science. 1 –10. https://doi.org/10.1177/09567976211036065 ??? I'm not evaluating the scientific quality or accuracy of these examples, but showing some possibilities of using of Textual analysis database of 4 billion Twitter posts (2012–2018), geolocated 321 million tweets to 582 commuting zones in the continental United States, of which 3,649 tweets used words peculiar to incels and 3,745 were about incels. tweets arise disproportionately within places where mating competition among men is likely to be high because of male-biased sex ratios, few single women, high income inequality, and small gender gaps in income. --- ## Examples of Text Analysis in Research - "Can Exposure to Celebrities Reduce Prejudice? The Effect of Mohamed Salah on Islamophobic Behaviors and Attitudes". - [PrePrint](https://osf.io/preprints/socarxiv/eq8ca) - [article](https://www.cambridge.org/core/journals/american-political-science-review/article/can-exposure-to-celebrities-reduce-prejudice-the-effect-of-mohamed-salah-on-islamophobic-behaviors-and-attitudes/A1DA34F9F5BCE905850AC8FBAC78BE58). American Political Science Review , Volume 115 , Issue 4 , November 2021 , pp. 1111 - 1128 DOI: https://doi.org/10.1017/S0003055421000423 ??? - Após Salah começar a jogar no Liverpool, as tags anti-muçulmanas dos torcedores do Liverpool caiu pela metade. - utilizando dados de 15 milhões de tweets, o "efeito Salah": após a chegada do jogador Mohamed Salah ao Liverpool, um jogador de elite visivelmente muçulmano, os crimes de ódio reduziram em 16% na área do Liverpool e os torcedores do time diminuíram pela metade as postagens anti-muçulmanas em relação aos torcedores de outros times [@SalahEffect2021]. --- ### Ex.: Predicting the gender by the name - Package [genderBR](https://cran.r-project.org/web/packages/genderBR/index.html): "A method to predict and report gender from Brazilian first names using the Brazilian Institute of Geography and Statistics' Census data" - MEIRELES, Fernando. 2010. "[genderBR: predizendo sexo a partir de nomes próprios](https://fmeireles.com/blog/rstats/genderbr-predizer-sexo/). No R e usando dados do Censo de 2010" -- ```r genderBR::get_gender(c("isabel", "marta", "silvia", "rodrigo", "roberto", "thiago")) ``` ``` ## [1] "Female" "Female" "Female" "Male" "Male" "Male" ``` ??? precisão média do método sempre foi maior que 95%, sem sinal de viés testado com dados de candidaturas do TSE o método classificou 99.5% das observações corretamente --- class: chapter-slide # Step 1. Gathering Data --- ## Step 1. Gathering Data <!-- Collecting data for Text Analysis -->
--- ## Collecting the data R packages Datasets examples - [dslabs](https://cran.r-project.org/web/packages/dslabs/index.html)Data Science Labs. Many datasets, one of them are `trump_tweets`: Trump Tweets from 2009 to 2017. - [Harry Potter](https://github.com/bradleyboehmke/harrypotter) de Bradley Boehmke (não confundir com o pacote homônimo no CRAN) possui o texto completo de sete livros da série Harry Potter. - [corpus](https://cran.r-project.org/web/packages/corpus) includes datasets such as The Federalist Papers - [wordcloud](https://cran.r-project.org/web/packages/wordcloud): SOTU "United States State of the Union Addresses (2010 and 2011). Transcripts of the state of the union speeches. saved as a tm Corpus".i - [Quanteda.corpora](https://github.com/quanteda/quanteda.corpora) has maby datasets, like: - The corpus `data_corpus_udhr` contains the Universal Declaration of Human Rights in over 400 languages. `data_corpus_udhr[c("eng", "deu_1996", "arb", "heb", "cmn_hans", "jpn")]`. [direct link to rda file](https://github.com/quanteda/quanteda.corpora/raw/master/data/data_corpus_udhr.rda) --- ```r install.packages("dslab") library(dslab) data(trump_tweets) ``` ??? Warning in install.packages : package ‘dslab’ is not available for this version of R --- ## APIs: "Application Programming Interface" - some needs token/password/authentication: -- - some do not --- ### API that needs some kind of authentication: - YouTube: package [tuber](https://cran.r-project.org/web/packages/tuber/) - Tweeter: - [tweeter_R](https://cran.r-project.org/web/packages/rtweet/) - [academictwitteR](https://github.com/cran/academictwitteR). Collect tweets from v2 API endpoint for the Academic Research Product Track - [tiktokR](https://benjaminguinaudeau.github.io/tiktokr/) - [LinkedIn](https://www.linkedin.com/pulse/rlinkedin-more-long-life-linkedin-api-network-analysis-jerad-acosta/) - [crowdtangle](https://help.crowdtangle.com/en/articles/4302208-crowdtangle-for-academics-and-researchers) --- ### Dont needs authentication: - GoogleNgrams: [ngramr](https://cran.r-project.org/web/packages/ngramr/index.html), - GoogleTrends: [gtrendsR](https://cran.r-project.org/web/packages/gtrendsR/). - Gutenberg: [Gutenbergr](https://cran.r-project.org/web/packages/gutenbergr/) get text from books of the [Gutenberg Project](https://www.gutenberg.org/). - Reddit: [RedditExtractoR](https://cran.r-project.org/web/packages/RedditExtractoR/) - Wikipedia: [WikipediaR](https://cran.r-project.org/web/packages/WikipediaR/), [WikipediR](https://cran.r-project.org/web/packages/WikipediR/), [wikipediatrend](https://cran.r-project.org/web/packages/wikipediatrend/index.html), [getwiki](https://cran.r-project.org/web/packages/getwiki/index.html): " retrieving text data in a tidy format that can be used for Natural Language Processing" - [imdbapi](https://cran.r-project.org/web/packages/imdbapi/index.html): Get Movie, Television Data from the 'imdb' Database - [speechbr](https://github.com/dcardosos/speechbr) returns speeches of the brazilian deputies. --- Not enough APIs? Try: - [datasets listed in my bookdown](https://soaresalisson.github.io/analisetextual/datasets.html) -- - The [R OpenSci site](https://ropensci.org/packages/data-access/) also has a list of R packages that work with APIs and scientifc articles that used them -- - [ProgrammableWeb](https://www.programmableweb.com/api-research) has more than -- 24.471 APIs --  --- ## Webscraping with R: - [Rvest](https://rvest.tidyverse.org/), - [Rcrawler](https://github.com/salimk/Rcrawler), - [RSelenium](https://cran.r-project.org/web/packages/RSelenium/index.html), Wickham, Hadley. 2014b. [Rvest: Easy Web Scraping with R. RStudio](https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/). Wickham, Hadley. 2019. [SelectorGadget](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html). --- Ready to use (and simple, but also nice) tools 1) [Google Trends](https://trends.google.com/trends/) - ([FAQ about Google Trends data](https://support.google.com/trends/answer/4365533?hl=en&ref_topic=6248052)) - ROGERS, Simon. 2016. [What is Google Trends data — and what does it mean?](https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8) 2) Google NGrams -- but in R! ??? What is google trends and google ngrams --- --- # GoogleTrends: gtrendsR ```r install.packages("gtrendsR") # Installing the package ``` ```r library(gtrendsR) # Calling the package library(ggplot2) library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` -- A simple query ```r GT <- gtrends(c("Donald Trump", "Joe Biden"), #geo = "US", time="2021-10-01 2022-04-01") typeof(GT) ``` ``` ## [1] "list" ``` - In parameter `geo = "US"`_you can specify the country by the 2 Letter code,. like "US", "AR", "BR", "CL", "MX", "UY" - Is possible to specify even more, restricting to state level, using 2 digits "country-State", for example "US-CA", "BR-MG". --- ```r names(GT) # names of the columns ``` ``` ## [1] "interest_over_time" "interest_by_country" "interest_by_region" ## [4] "interest_by_dma" "interest_by_city" "related_topics" ## [7] "related_queries" ``` ```r nrow(GT$interest_over_time) # how many lines are there in dataset ``` ``` ## [1] 366 ``` ```r head(GT$interest_over_time, 4) ``` ``` ## date hits keyword geo time gprop category ## 1 2021-10-01 15 Donald Trump world 2021-10-01 2022-04-01 web 0 ## 2 2021-10-02 14 Donald Trump world 2021-10-01 2022-04-01 web 0 ## 3 2021-10-03 15 Donald Trump world 2021-10-01 2022-04-01 web 0 ## 4 2021-10-04 16 Donald Trump world 2021-10-01 2022-04-01 web 0 ``` --- ```r GT$interest_over_time |> ggplot(aes(x= date)) + geom_line(aes(y=hits, colour=keyword)) ``` <!-- --> --- A good and interesting example from [Gustavo Prado (2020)](https://rpubs.com/pradogps/gtrends), how Gtrends can show social trends ```r plot_trend <- function(keyword_string) { data <- gtrends(c(keyword_string), time= "2020-01-01 2020-10-14", geo = "BR") time_trend <- data$interest_over_time %>% mutate(hits=ifelse(hits=="<1",0.5,hits), date=as.Date(date), keyword=factor(keyword, levels = keyword_string)) plot <- ggplot(data=time_trend, aes(x=date, y=as.numeric(hits), colour=keyword)) + geom_line(size = .9) + geom_vline(xintercept = as.numeric(as.Date("2020-03-10")), color = "dark gray", linetype = "dashed", size = 1.0) + xlab("Time") + ylab("relative interest") + ggthemes::theme_clean(base_size = 10, base_family = "mono") + theme(legend.position = "bottom", legend.title = element_blank(), legend.text=element_text(size=9)) + ggtitle("Searches in Google") + scale_color_brewer(palette = "Set2") return(plot) } ``` --- ```r cinema_plot <- plot_trend(keyword_string = c( "cinema")) parque_plot <- plot_trend(keyword_string = c("parque")) academia_plot <- plot_trend(keyword_string = c( "academia")) bar_plot <- plot_trend(keyword_string = c("bar")) gridExtra::grid.arrange(cinema_plot, academia_plot, parque_plot, bar_plot, ncol=2, top = "Busca virtual por entretenimento e lazer em 2020: como a covid-19 afetou a indústria.") ``` <!-- --> --- # Wikipedia - [getwiki](https://cran.r-project.org/web/packages/getwiki/index.html): "retrieving text data in a tidy format that can be used for Natural Language Processing" ### getwiki package ```r install.packages("getwiki") # installing the package ``` --- ```r library(getwiki) # loading the package search_wiki("El Chavo del Ocho") # searching ``` ``` ## titles ## 1 30 Anos de Chaves ## 2 Angelines Fernández ## 3 Carlos Villagrán ## 4 Chespirito ## 5 Chespirito (TV series) ## 6 El Chapulín Colorado ## 7 El Chavo (disambiguation) ## 8 El Chavo Animado ## 9 El Chavo Kart ## 10 El Chavo del Ocho ## 11 En el Cine (El Chavo del Ocho episode) ## 12 Florinda Meza ## 13 Horacio Gómez Bolaños ## 14 List of El Chavo del Ocho characters ## 15 María Antonieta de las Nieves ## 16 Ramón Valdés ## 17 Raúl Padilla ## 18 Rubén Aguirre ## 19 Turkish March (Beethoven) ## 20 Édgar Vivar ## content ## 1 30 Anos de Chaves (English: 30 Years of El Chavo, Spanish: 30 Años de El Chavo) is a Brazilian TV special celebrating the 30 anniversary of the Mexican TV series El Chavo del Ocho created by Roberto Gómez Bolaños. This special was aired on SBT on August 19, 2011. In opposition to the title, the special marks the 30th anniversary of the broadcaster of the show. ## 2 María de los Ángeles Fernández Abad (30 July 1924 – 25 March 1994), known professionally as Angelines Fernández, was a Spanish-Mexican actress and comedian. She is best remembered for playing Doña Clotilde "La Bruja del 71" in the sitcom El Chavo del Ocho. She was an anti-Franco refugee who remained in Mexico (in addition to a brief stint in Cuba) from 1947 until the end of her life. ## 3 Carlos Villagrán Eslava (born 12 January 1944) is a Mexican actor, comedian, and former journalist best known for playing Quico in the Televisa sitcom El Chavo del Ocho and the Telerey sitcom ¡Ah cabellos!” ## 4 Roberto Gómez Bolaños (21 February 1929 – 28 November 2014), more commonly known by his stage name Chespirito, or "Little Shakespeare", was a Mexican actor, comedian, screenwriter, humorist, director, producer, and author. He is widely regarded as one of the icons of Spanish-speaking humor and entertainment and one of the greatest comedians of all time. He is also one of the most loved and respected comedians in Latin America. He is mostly known by his acting role Chavo from the sitcom El Chavo del 8.He is recognized all over the planet for writing, directing, and starring in the Chespirito (1970–1973, 1980–1995), El Chavo del Ocho (1973–1980), and El Chapulín Colorado (1973–1979) television series. The character of El Chavo is one of the most iconic in the history of Latin American television, and El Chavo del Ocho continues to be immensely popular, with daily worldwide viewership averaging 91 million viewers. ## 5 Chespirito is a Mexican sketch comedy show created by and starring comedian and actor Roberto Gomez Bolaños, whose nickname gave the show its title.Two series were produced with the same title. The first premiered as Los Supergenios de la Mesa Cuadrada on Televisión Independiente de México on October 15, 1970, after a two-year span of this sketch being part of the Sábados de la Fortuna/Sábados con Neftalí/Carrousel con Neftalí show (hosted by Neftalí Lopez Paez), aired in the same channel, since October 1968, during the 1968 Summer Olympics in Mexico City; the independent series adopted the Chespirito y La Mesa Cuadrada and later the Chespirito title in 1971, and aired until February 1973. The second series, which aired on the TIM's successor, Televisa, premiered on 4 February 1980 and aired until 25 September 1995.Alongside Bolaños, other famous Mexican actors starred in the sketches. In the two periods, characters like El Chavo del Ocho, El Chapulín Colorado, Los Caquitos, Dr. Chapatín, Los Chifladitos, El Ciudadano Gómez, La Chicharra, Chespirito (character), Los Chiripiojos and the parodies of Chaplin and Laurel and Hardy starred in sketches 2–15 minutes long. The 1980-1995 period also featured special 40-minute episodes, even consisting of two or three 40-minute long parts.The show's first seasons featured a canned laugh track, but Bolaños made the then-controversial decision to drop it, in 1982, after the departure of Ramón Valdés. From 1982 to 1985, most presentations of the show include the announcer's warning that, out of respect for the audience's intelligence, there was no laugh track. ## 6 El Chapulín Colorado (English: The Red Grasshopper) is a Mexican television comedy series that ran from 1970 to 1993 and parodied superhero shows. It was created by Roberto Gómez Bolaños (Chespirito), who also played the main character. It was first aired by Televisa in 1973 in Mexico, and then was aired across Latin America and Spain until 1981, alongside El Chavo, which shared the same cast of actors. Both shows have endured in re-runs and have won back some of their popularity in several countries such as Colombia or Peru, where it has aired in competition with The Simpsons (which features a recurring parody of the character). The name translates literally in English as "The Red Grasshopper" or "The Cherry Cricket" (the word chapulín is of Nahuatl origin and applies to a Mexican species of grasshopper, while colorado means "red".). The main character uses a conspicuous red uniform. It is known in Brazil as "Chapolin", "Vermelhinho" ("Little Red One") and "Polegar Vermelho" ("Red Thumb") in allusion to the famous fairy tale character Tom Thumb.Although the series has a regular cast (the same cast as El Chavo), all actors but Gómez Bolaños play different characters each episode, and it is therefore described as an anthology series. ## 7 El Chavo is a TV series aired between 1973 and 1980 originally titled of El Chavo del Ocho.El Chavo may also refer to:El Chavo AnimadoEl Chavo (character)El Chavo (video game) ## 8 El Chavo Animado (El Chavo: The Animated Series in English) is a Mexican animated series based on the live action television series El Chavo del Ocho, created by Roberto Gómez Bolaños, produced by Televisa and Ánima Estudios. It aired on Canal 5, and repeats were also shown on Las Estrellas and Cartoon Network Latin America. 134 episodes aired between 2006 and 2014 and starts a hiatus.After several years of successful repeats of the original series, on 21 October 2006 Televisa launched in Mexico and the rest of Latin America an animated version of the program by Ánima Estudios to capitalise on the original series' popularity. With the series, Televisa began a marketing campaign which included merchandise tie-ins. For the series' launch event, a set was built (imitating the computerised background) on which the animation was said. Many elements of the original series, including most of the original stories, were included in the animated series.El Chavo Animado also aired in English via Kabillion's on-demand service in the USA. Although it was part of the video-on-demand service, the series did not appear on the Kabillion website until the site's April 2012 relaunch. The series is currently airing on bitMe and Distrito Comedia as of 2020 and it should be noted as of 2022, It now airs on Galavisión along with El Chapulín Colorado Animado. ## 9 El Chavo Kart (Chaves Kart in Brazil) is a 2014 kart racing game created by Colombian developer Efecto Studios and Mexican developer Slang and published by Televisa Home Entertainment for Xbox 360 and PlayStation 3. A conversion of the game was also released on Android, but was later removed. The game features almost all of the characters of El Chavo: The Animated Series (except for Jaimito and Gloria), with tracks loosely based on locations from the animated series. In 2020, an updated version of the game was released for iOS and Android with an updated art style reminiscent of Funko POP! toys. ## 10 El Chavo (English: The Kid); — also known as El Chavo del Ocho (English: The Kid from number Eight) during its earliest episodes —, is a Mexican television sitcom created by Roberto Gómez Bolaños, produced by Televisa. It premiered on April 27, 1972 and ended on June 12, 1992 after 7 seasons and 311 episodes. The series gained enormous popularity in Hispanic America, Brazil, Spain and other countries. The series theme song is a rendition of Ludwig van Beethoven's Turkish March, rearranged by Jean-Jacques Perrey & retitled “The Elephant Never Forgets”.The show follows the adventures and tribulations of the title character—a poor orphan nicknamed "El Chavo" (which means "The Kid"), played by the show's creator, Roberto Gómez Bolaños "Chespirito"—and his friends, which often cause conflict, of a comedic nature, between the other inhabitants of a fictional low-income housing complex, or, as called in Mexico, vecindad. The idea for the show emerged from a sketch created by Gómez Bolaños where an 8-year-old boy competed with a balloon vendor in a park, said sketch aired for the first time on April 27, 1972. The show centered great importance into the development of the characters, which were each assigned a distinctive personality. Since the beginning, Gómez Bolaños decided that El Chavo would be directed toward an adult audience, even though the show itself was about adults interpreting kids. The main cast consisted of Gómez Bolaños, Ramón Valdés, Carlos Villagrán, María Antonieta de las Nieves, Florinda Meza, Rubén Aguirre, Angelines Fernández and Édgar Vivar, who interpreted El Chavo, Don Ramón, Quico, Chilindrina, Doña Florinda, Profesor Jirafales, Doña Clotilde and Señor Barriga. Direction and production of the series fell on Gómez Bolaños, Enrique Segoviano and Carmen Ochoa.El Chavo first appeared in 1972 as a sketch in the Chespirito show which was produced by Televisión Independiente de México (TIM). In 1973, following the merger of TIM and Telesistema Mexicano, it was produced by Televisa and became a weekly half-hour series, which ran until 1980. After that year, shorts continued to be shown in Chespirito until 1992. At its peak of popularity during the mid-1970s, it had a Latin American audience of over 350 million viewers per episode. Given the popularity of the show, the cast went on a global tour to countries in which the show already aired and, in a series of presentations, the cast would dance and act in front of the public.The Brazilian Portuguese dub, Chaves, has been inferred by Brazilian TV Network SBT since 1984, was also seen on the Brazilian versions of Cartoon Network and Boomerang, and currently is also seen on Multishow. Since 2 May 2011, it has aired in the United States on the UniMás network. It previously aired on sister network Univision and its predecessor, the Spanish International Network. It spawned an animated series titled El Chavo Animado.El Chavo continues to be popular with syndicated episodes averaging 91 million daily viewers in all of the markets where it is distributed in the Americas. Since it ceased production in 1992, it has earned an estimated US$1.7 billion in syndication fees alone for Televisa.El Chavo was also available on Netflix in the United States, but was removed on December 31, 2019. ## 11 "En el Cine" (English: At the Cinema) is the first episode of the seventh season of the Mexican sitcom El Chavo del Ocho, which aired originally on Televisa on January 29, 1979. The episode was written and directed by creator Roberto Gomez Bolaños. In the episode, everyone in the vecindad goes to the movies, but they end up causing a commotion there. It is the first episode without Carlos Villagrán in the cast, as he left the series after the sixth season. ## 12 Florinda Meza García (born 8 February 1949) is a Mexican actress, comedian, television producer, and screenwriter. She is best known as Doña Florinda in El Chavo del Ocho, La Chimoltrufia in Chespirito, and other various roles in El Chapulín Colorado. ## 13 Horacio Gómez Bolaños (28 June 1930 – 21 November 1999) was a Mexican actor and brother of Roberto Gómez Bolaños (Chespirito). On the TV show El Chavo, he played the character Godínez. Although Horacio appeared in many of his brother's productions, he preferred to handle the business aspects.Gómez Bolaños did not consider an acting career when he was young. Instead, he went to university to study business and graduated with a degree in business administration.When Chespirito started production of El Chavo del Ocho and El Chapulín Colorado in Televisa during 1970, he needed an experienced sales team to look over the marketing side of the productions. Chespirito hired his brother, who was to see, among other things, the sales of products related to his shows, such as toys, clothes and other show related items.Chespirito saw something else in his brother, however, and soon, he convinced Horacio to try it out as an actor. As a result, Horacio Gómez Bolaños got the character "Godínez" on El Chavo del Ocho. Horacio Gómez Bolaños appeared less frequently than his co-stars on both of Chespirito's shows. Nevertheless, he also attained wide fame internationally when the show became a favorite among millions of Latin American children, as well as in Spain, the United States and other countries.After the production of both El Chavo del Ocho and El Chapulín Colorado were finished, Horacio Gómez Bolaños retired from acting, focusing instead on directing, producing and overseeing the marketing aspects of other Televisa productions. ## 14 El Chavo del Ocho, often shortened to El Chavo, is a Mexican television sitcom created by Roberto Gómez Bolaños. The show was based on a series of sketches performed on Gómez's eponymous sketch show, Chespirito, which were first performed in 1971. El Chavo became its own series in 1973 and aired until 1980, becoming one of the most popular television programs in the world. Following its cancellation and the relaunch of Chespirito, the El Chavo sketches returned in 1980 and continued to be performed on Chespirito until 1992 when Gómez, by this point in his sixties, discontinued them due to his advancing age.The show follows the life and tribulations of the title character, a poor orphan child (played by Gómez Bolaños) that lives on a Mexican housing complex, typically called a vecindad. He is accompanied by a cast of neighbors, children, and other characters. All the characters, including the children, were played by adults on El Chavo. ## 15 María Antonieta Gómez Rodríguez (born 22 December 1949), more commonly known by her stage name María Antonieta de las Nieves, is a Mexican actress, comedian, singer, and author. Her best remembered role is that of La Chilindrina, one of the main characters of the Televisa sitcom El Chavo del Ocho. ## 16 Ramón Antonio Esteban Gómez Valdés y Castillo (2 September 1923 – 9 August 1988) was a Mexican actor and comedian. He is best remembered for his portrayal of Don Ramón. He is also recognized as one of Mexico's best comedians.Born in Mexico City, he was raised in a humble and large family that moved to Ciudad Juárez when he was aged two. Valdés made his acting debut at cinema in the movie Tender Pumpkins (1949), appearing along with his brother, Germán Valdés, already an actor better known as "Tin-Tan", and who introduced Ramón into the acting world. Under extra or supporting roles, he continued making appearances in films during the Golden Age of Mexican cinema. Ramón and Germán had two other brothers, also actors, Manuel Valdés, better known as "Manuel "El Loco" Valdés", and Antonio Valdés, better known as "El Ratón Valdés".In 1968, Valdés met Roberto Gómez Bolaños, better known as "Chespirito", with whom he began working on programs such as Los supergenios de la mesa cuadrada, Chespirito and El Chapulín Colorado. It was on Bolaños's sitcom El Chavo del Ocho that he gained international fame for his portrayal of Don Ramón. He left El Chavo del Ocho in 1979 but returned in 1981 for his final year on the project.In 1982, Valdés starred with Carlos Villagrán on the Venezuelan sitcom Federrico and on Ah que Kiko in 1987. ## 17 Raúl "Chato" Padilla Mendoza (17 June 1918 – 3 February 1994) was a Mexican actor, and a member of Chespirito's comedy troupe, famous for his character in El Chavo del Ocho, Jaimito, el Cartero ("Jaimito, the Mailman"). ## 18 Rubén Aguirre Fuentes (Spanish pronunciation: [ruˈβen aˈɣire ˈfwentes]; 15 June 1934 – 17 June 2016) was a Mexican actor and comedian. He was best known for his character Profesor Jirafales in Televisa's 1970s television show El Chavo del Ocho. Aguirre also participated in another well known television show of the era, El Chapulín Colorado, albeit less frequently. ## 19 The Turkish March (Marcia alla turca) is a classical march theme by Ludwig van Beethoven. It was written for the 1809 Six variations, Op. 76, and in the Turkish style. Later in 1811, Beethoven included the Turkish March in a play by August von Kotzebue called The Ruins of Athens (Op. 113), which premiered in Budapest, Hungary in 1812.The march is in B-flat major, tempo vivace and 24 time. Its dynamic scheme is highly suggestive of a procession passing by, starting out pianissimo, poco a poco rising to a fortissimo climax and then receding back to pianissimo by the coda. ## 20 Édgar Ángel Vivar Villanueva (born 28 December 1944) is a Mexican actor, and comedian. He is remembered as "Señor Barriga" and his son "Ñoño" from El Chavo del Ocho, and as "El Botija" from Los Caquitos and Chespirito. His other notable role is in a Mexico telenovela, Amarte así as Don Pedro, a lonely man who work as a cashier in the restaurant of his stepson El Frijol. ``` --- How often a particular page has been viewed on Wikipedia? ```r x <- "El Chavo del Ocho" ChavoWikiTrends <- trend_wiki(x) head(ChavoWikiTrends) ``` ``` ## title date views ## 1 El Chavo del Ocho 2022-08-16 556 ## 2 El Chavo del Ocho 2022-08-17 600 ## 3 El Chavo del Ocho 2022-08-18 596 ## 4 El Chavo del Ocho 2022-08-19 582 ## 5 El Chavo del Ocho 2022-08-20 582 ## 6 El Chavo del Ocho 2022-08-21 587 ``` --- ```r plot(ChavoWikiTrends$date, ChavoWikiTrends$views, type = "l", col = "blue", main="Views in English Wikipedia ", xlab="date", ylab="views", sub="Wikipedia Article: \"El Chavo del Ocho\"") ``` <!-- --> --- ```r presidentialElection_BR1 <- trend_wiki("Lula") presidentialElection_BR2 <- trend_wiki("Jair Bolsonaro") head(presidentialElection_BR1, 4) ``` ``` ## title date views ## 1 Luiz Inácio Lula da Silva 2022-08-16 5172 ## 2 Luiz Inácio Lula da Silva 2022-08-17 3014 ## 3 Luiz Inácio Lula da Silva 2022-08-18 1942 ## 4 Luiz Inácio Lula da Silva 2022-08-19 2149 ``` ```r head(presidentialElection_BR2, 4) ``` ``` ## title date views ## 1 Jair Bolsonaro 2022-08-16 3552 ## 2 Jair Bolsonaro 2022-08-17 3425 ## 3 Jair Bolsonaro 2022-08-18 2798 ## 4 Jair Bolsonaro 2022-08-19 3329 ``` --- ```r GG_PE <- rbind(presidentialElection_BR1, presidentialElection_BR2) # bindingde dataframes sample_n(GG_PE, 6) # instead oh head, taking a sample from the dataframe ``` ``` ## title date views ## 1 Luiz Inácio Lula da Silva 2022-09-01 1722 ## 2 Jair Bolsonaro 2022-09-05 9419 ## 3 Jair Bolsonaro 2022-09-02 3848 ## 4 Jair Bolsonaro 2022-09-16 3230 ## 5 Jair Bolsonaro 2022-10-12 4231 ## 6 Jair Bolsonaro 2022-10-01 12036 ``` ```r GG_PE.BR <- GG_PE |> ggplot(aes(x= date)) + geom_line(aes(y=views, colour=title)) + theme(legend.position = "top") + labs(title="Views in English Wikipedia", subtitle = "Views of the brazilian presidential candidates in English Wikipedia", caption = "considered year: 2022\nElaboration: @alissonmasoares") + scale_color_manual(values = c("blue", "red"), "Presidential candidate") + geom_vline(xintercept=as.numeric(as.Date("2022-10-02")), linetype = "solid", color = "darkgreen") + annotate(geom="label", x=as.Date("2022-10-02"), y=4925, label="election day", colour= "darkgreen") ``` --- ```r GG_PE.BR ``` <!-- --> --- Downside of this package: capture only articles in english wikipedia! -- Wikipedia packages in R: - [WikipediR](https://cran.r-project.org/web/packages/WikipediR/) "A wrapper for the MediaWiki API, aimed particularly at the Wikimedia 'production' wikis, such as Wikipedia. It can be used to retrieve page text, information about users or the history of pages, and elements of the category tree." - [WikipediaR](https://cran.r-project.org/web/packages/WikipediaR/) "functions provide details for a specific Wikipedia page : all links that are present, all pages that link to, all the contributions (revisions for main pages, and discussions for talk pages). Two functions provide details for a specific user : all contributions, and general information (as name, gender, rights or groups). It provides additional information compared to others packages, as WikipediR. It does not need login." - [wikipediatrend](https://cran.r-project.org/web/packages/wikipediatrend/index.html) "Public Subject Attention via Wikipedia Page View Statistics" - [getwiki](https://cran.r-project.org/web/packages/getwiki/index.html): "retrieving text data in a tidy format that can be used for Natural Language Processing" --- ## Google Ngrams ```r library(ngramr) library(dplyr) ``` -- Houston, we have a problem! ```r ngram(c("mouse", "rat"), year_start = 1950) ``` ``` ## Please check Google's Ngram Viewer site is up. ``` ``` ## Timeout was reached: [books.google.com] Operation timed out after 264 milliseconds with 0 bytes received ``` ``` ## NULL ``` --- As a workaround, we can use the `hacker` database sample from the package ngramr ```r str(ngramr::hacker) ``` ``` ## ngram [236 × 4] (S3: ngram/tbl_df/tbl/data.frame) ## $ Year : num [1:236] 1950 1951 1952 1953 1954 ... ## $ Phrase : Factor w/ 2 levels "hacker","programmer": 1 1 1 1 1 1 1 1 1 1 ... ## $ Frequency: num [1:236] 9.49e-09 1.17e-08 1.08e-08 1.01e-08 9.68e-09 ... ## $ Corpus : Factor w/ 2 levels "eng_2012","eng_us_2012": 1 1 1 1 1 1 1 1 1 1 ... ## - attr(*, "smoothing")= num 3 ## - attr(*, "case_sensitive")= logi TRUE ``` --- ```r filter(hacker,Corpus == "eng_2012") |> ggplot()+ aes(x=Year) + geom_line(aes(y=Frequency, colour=Phrase)) ``` <!-- --> --- "ggram downloads data from the Google Ngram Viewer website and plots it in ggplot2 style." ```r ggram(#c("monarchy", "democracy"), c("monarchy", "democracy"), year_start = 1500, year_end = 2000, #corpus = "en-2019", # corpus = "eng_gb_2012", ignore_case = TRUE, geom = "area", geom_options = list(position = "stack")) + labs(y = NULL) ```  --- **Choosing your corpora:** To see the available corpora, like “eng_us_2019”, “eng_gb_2019”, “chi_sim_2019”, “fre_2019”, “ger_2019”, “heb_2019”, “ger_2012”, “spa_2012”, “rus_2012”, “ita_2012”. use the function `corpuses` ```r termo = "span" ngramr::corpuses |> filter(grepl(termo, Language, ignore.case = T)) ``` ``` ## Shorthand Shorthand.Old Language Informal.corpus.name ## es-2019 es-2019 spa_2019 Spanish Spanish 2019 ## es-2012 es-2012 spa_2012 Spanish Spanish 2012 ## es-2009 es-2009 spa_2009 Spanish Spanish 2009 ## Persistent.identifier ## es-2019 googlebooks-spa-20200217 ## es-2012 googlebooks-spa-all-20120701 ## es-2009 googlebooks-spa-all-20090715 ## Description Last.Year ## es-2019 Books predominantly in the Spanish language. 2019 ## es-2012 Books predominantly in the Spanish language. 2009 ## es-2009 Books predominantly in the Spanish language. 2008 ``` or [this link](https://books.google.com/ngrams/info#) --- --- ## Gutenbergr Installing the package [gutenbergr](https://github.com/cran/gutenbergr) ```r install.packages("gutenbergr") # Archived on 2022-10-03 ``` -- ```r devtools::install_github("ropensci/gutenbergr") ``` -- Loading the package ```r library(dplyr) library(gutenbergr) ``` --- Searching for Books/authors ```r str(gutenberg_works()) ``` ``` ## tibble [40,737 × 8] (S3: tbl_df/tbl/data.frame) ## $ gutenberg_id : int [1:40737] 0 1 2 3 4 5 6 7 8 9 ... ## $ title : chr [1:40737] NA "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" ... ## $ author : chr [1:40737] NA "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" ... ## $ gutenberg_author_id: int [1:40737] NA 1638 1 1666 3 1 4 NA 3 3 ... ## $ language : chr [1:40737] "en" "en" "en" "en" ... ## $ gutenberg_bookshelf: chr [1:40737] NA "United States Law/American Revolutionary War/Politics" "American Revolutionary War/Politics/United States Law" NA ... ## $ rights : chr [1:40737] "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ... ## $ has_text : logi [1:40737] TRUE TRUE TRUE TRUE TRUE TRUE ... ## - attr(*, "date_updated")= Date[1:1], format: "2016-05-05" ``` ```r gutenberg_works() |> filter(title == "Don Quixote") ``` ``` ## # A tibble: 1 × 8 ## gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴ ## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl> ## 1 996 Don Quixote Cervantes Saa… 505 en Harvar… Publi… TRUE ## # … with abbreviated variable names ¹gutenberg_author_id, ²language, ## # ³gutenberg_bookshelf, ⁴has_text ``` ```r gutenberg_works() |> filter(grepl("Cervantes", author, ignore.case = T)) ``` ``` ## # A tibble: 48 × 8 ## gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴ ## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl> ## 1 996 Don Quixote Cerva… 505 en Harvar… Publi… TRUE ## 2 5903 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 3 5904 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 4 5905 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 5 5906 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 6 5907 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 7 5908 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 8 5909 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 9 5910 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## 10 5911 The History of Do… Cerva… 505 en <NA> Publi… TRUE ## # … with 38 more rows, and abbreviated variable names ¹gutenberg_author_id, ## # ²language, ³gutenberg_bookshelf, ⁴has_text ``` ```r filter(ngramr::corpuses, grepl(termo, Language, ignore.case = T)) ``` ``` ## Shorthand Shorthand.Old Language Informal.corpus.name ## es-2019 es-2019 spa_2019 Spanish Spanish 2019 ## es-2012 es-2012 spa_2012 Spanish Spanish 2012 ## es-2009 es-2009 spa_2009 Spanish Spanish 2009 ## Persistent.identifier ## es-2019 googlebooks-spa-20200217 ## es-2012 googlebooks-spa-all-20120701 ## es-2009 googlebooks-spa-all-20090715 ## Description Last.Year ## es-2019 Books predominantly in the Spanish language. 2019 ## es-2012 Books predominantly in the Spanish language. 2009 ## es-2009 Books predominantly in the Spanish language. 2008 ``` ??? Downside: only books in english! [example of use of gutenbergr](https://bookdown.org/Maxine/tidy-text-mining/the-gutenbergr-package.html) in a text mining tutorial --- --- ## APIs More Tips - Online book [APIs for social scientists](https://bookdown.org/paul/apis_for_social_scientists/) with tutorials: "A collaborative review by Paul C. Bauer, Camille Landesvatter, many others. The present online book provide a review of APIs that may be useful for social scientists. Covers a wide selection of APIs from google, Instagram, Youtube and others. R code included." - The [R OpenSci site](https://ropensci.org/) also has a list of R packages that work with APIs --- class: chapter-slide # 2. Doing (some) text analysis in R without coding --- ## Text Analysis - GUI Online - [voyant tools](https://voyant-tools.org/) (website) - [YouTube tutorial](https://www.youtube.com/watch?v=ZxIbTzbFuYs&list=PL6q06atDhFmAV2uCLgH2sK1kWXWuYyucp) (in portuguese) Using R - [R Text Mining Solution](https://github.com/nalimilan/R.TeMiS): - [R-commander (Rcmdr)](https://cran.r-project.org/web/packages/Rcmdr/index.html) + [RcmdrPlugin.temis](https://cran.r-project.org/web/packages/RcmdrPlugin.temis/index.html) (last actualization in 2018). - [RKWard](https://rkward.kde.org/) + [koRpus package](https://cran.r-project.org/web/packages/koRpus/index.html) - [IRaMuTeQ](http://iramuteq.org/) Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires) --- --- class: chapter-slide # 3. Doing Text Analysis (with steroids) in R writing code --- ### 3.1 - Data cleaning (or data wrangling) <!-- Once you have the data, now we have to do some preparations --> --- ## Regular Expressions Regular expression, regex or regexp are sequences used in search and substitution text using patterns: - What: if it is numbers, or letters, or punctuation (of any kind) - `\d`, `\w`, `[awdf]`, `[A-Za-z]`, - if it is NOT one of these: - `\D`, `\W` - how many (quantifiers): 1,2,3, one or none `?`, one or more `+`, none or some `*`. - where: in the beginning of line, at the end, after/before a specific character? --- ```r texto = "ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details." gsub("[gG]raph", "GRAPH", texto) ``` ``` ## [1] "ggplot2 is a system for declaratively creating GRAPHics, based on The Grammar of GRAPHics. You provide the data, tell ggplot2 how to map variables to aesthetics, what GRAPHical primitives to use, and it takes care of the details." ``` -- With regex you can change the order of the text elements: ```r nome = "Joaquim José Xavier" gsub("^(.*) (\\w+)$", "\\2, \\1", nome) ``` ``` ## [1] "Xavier, Joaquim José" ``` --- ### more about regex - [Cheatsheet of regex in R](https://github.com/rstudio/cheatsheets/raw/master/regex.pdf) (em inglês) RECOMENDADO - ["Regular Expressions as used in R"](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html).Expressão regular no R. Documento oficial. (em inglês) - Albert Y. Kim. [Regular Expressions in R](https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html) - [Regular Expressions with The R Language](https://www.regular-expressions.info/rlanguage.html). Site dedicado às RegEx em várias linguagens de programação. - PENG, Roger. [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html). 2020. Bookdown online. <!-- - O pacote [regexplain](https://www.garrickadenbuie.com/project/regexplain/) ajuda a testar regexes e comandos que usam regex no R de modo interativo e fácil, utilizando interface baseada em shiny. --> --- ## Tokenization Splitting a text into vectors with `strsplit()` ```r strsplit(texto, "[,\\.]") # split into sentences ``` ``` ## [[1]] ## [1] "ggplot2 is a system for declaratively creating graphics" ## [2] " based on The Grammar of Graphics" ## [3] " You provide the data" ## [4] " tell ggplot2 how to map variables to aesthetics" ## [5] " what graphical primitives to use" ## [6] " and it takes care of the details" ``` ```r strsplit(texto, " ") # split into words ``` ``` ## [[1]] ## [1] "ggplot2" "is" "a" "system" ## [5] "for" "declaratively" "creating" "graphics," ## [9] "based" "on" "The" "Grammar" ## [13] "of" "Graphics." "You" "provide" ## [17] "the" "data," "tell" "ggplot2" ## [21] "how" "to" "map" "variables" ## [25] "to" "aesthetics," "what" "graphical" ## [29] "primitives" "to" "use," "and" ## [33] "it" "takes" "care" "of" ## [37] "the" "details." ``` ```r t <- strsplit(texto, "[,\\.] ") |> unlist() t ``` ``` ## [1] "ggplot2 is a system for declaratively creating graphics" ## [2] "based on The Grammar of Graphics" ## [3] "You provide the data" ## [4] "tell ggplot2 how to map variables to aesthetics" ## [5] "what graphical primitives to use" ## [6] "and it takes care of the details." ``` --- Or if you want to do the opposite, to join vector elements into only one piece, use `paste( , collapse = " ")` . In this case we'll join only the two first elements ```r t[1:2] ``` ``` ## [1] "ggplot2 is a system for declaratively creating graphics" ## [2] "based on The Grammar of Graphics" ``` ```r paste(t[1:2], collapse = " ") ``` ``` ## [1] "ggplot2 is a system for declaratively creating graphics based on The Grammar of Graphics" ``` --- ## stringr package  Stringr is used to make easy to use regex, in a more intuitive way. ```r library(stringr) ``` --- ```r Ramon <- "No hay trabajo malo, lo malo es tener que trabajar" str_to_upper(Ramon) # tudo em minúscula ``` ``` ## [1] "NO HAY TRABAJO MALO, LO MALO ES TENER QUE TRABAJAR" ``` ```r str_to_lower(Ramon) # tudo em maiúsucla ``` ``` ## [1] "no hay trabajo malo, lo malo es tener que trabajar" ``` ```r str_to_title(Ramon) # só primeiras letras em maiúsculo ``` ``` ## [1] "No Hay Trabajo Malo, Lo Malo Es Tener Que Trabajar" ``` ```r str_count(Ramon, "trabaj") #counting how many times a pattern appears ``` ``` ## [1] 2 ``` ```r str_replace(Ramon, "malo", "bueno") ``` ``` ## [1] "No hay trabajo bueno, lo malo es tener que trabajar" ``` ```r str_replace_all(Ramon, "malo", "bueno") ``` ``` ## [1] "No hay trabajo bueno, lo bueno es tener que trabajar" ``` ```r str_remove_all(Ramon, "malo") # remove all words "malo" ``` ``` ## [1] "No hay trabajo , lo es tener que trabajar" ``` ```r c("", "a", "b") |> str_subset("") # filters the vector ``` ``` ## [1] "a" "b" ``` --- - [cheatcheet stringr](http://edrub.in/CheatSheets/cheatSheetStringr.pdf) - J. Kyle Armstrong [Fundamentals of Data Wrangling with R](https://bookdown.org/jkylearmstrong/jeff_data_wrangling/) 2021. bookdown online --- ## Typical steps in TA In the bag of words approach: - tokenization: break into parts (sentences, terms, words, characters) - Removing stopwords and punctuation - stemming or lemmatization - Substitutions (synonyms, abbreviations) --- ## Tokenization ```r library(ngram) ``` ``` ## ## Attaching package: 'ngram' ``` ``` ## The following object is masked from 'package:ngramr': ## ## ngram ``` ```r library(dplyr) ngram::ngram(Ramon, 2) |> get.ngrams() ``` ``` ## [1] "malo, lo" "lo malo" "No hay" "hay trabajo" ## [5] "trabajo malo," "malo es" "que trabajar" "tener que" ## [9] "es tener" ``` ```r RamonNgram <- ngram::ngram(Ramon, 3) RamonNgram ``` ``` ## An ngram object with 8 3-grams ``` ```r ngram::get.ngrams(RamonNgram) # printing ``` ``` ## [1] "es tener que" "trabajo malo, lo" "lo malo es" ## [4] "tener que trabajar" "hay trabajo malo," "malo, lo malo" ## [7] "malo es tener" "No hay trabajo" ``` ```r Quijote <- readLines(url("https://www.gutenberg.org/files/2000/2000-0.txt")) str(Quijote) ``` ``` ## chr [1:38062] "The Project Gutenberg eBook of Don Quijote, by Miguel de Cervantes Saavedra" ... ``` ```r QuijoteTokens <- Quijote |> strsplit("\\W") |> unlist() QuijoteTokens |> plyr::count() |> arrange(-freq) |> head() ``` ``` ## x freq ## 1 61447 ## 2 que 20668 ## 3 de 18147 ## 4 y 17249 ## 5 la 10331 ## 6 a 9641 ``` ```r stopwords <- c("que", "de", "y") stopwords <- "que de y la las a o los de del el en" |> strsplit(" ") |> unlist() stopwords ``` ``` ## [1] "que" "de" "y" "la" "las" "a" "o" "los" "de" "del" "el" "en" ``` ```r QuijoteTokens2 <- QuijoteTokens[!QuijoteTokens %in% stopwords] |> plyr::count() |> arrange(-freq) |> head(10) QuijoteTokens2 ``` ``` ## x freq ## 1 61447 ## 2 no 5806 ## 3 se 4752 ## 4 con 4126 ## 5 por 3784 ## 6 lo 3424 ## 7 le 3418 ## 8 su 3356 ## 9 don 2602 ## 10 me 2345 ``` -- ```r swEs <- stopwords::stopwords(language = "es") str(swEs) ``` ``` ## chr [1:308] "de" "la" "que" "el" "en" "y" "a" "los" "del" "se" "las" "por" ... ``` Add more words to the our stopwords ```r SW.Es <- c(swEs, "con", "Y", "") ``` ```r QuijoteTokens2 <- QuijoteTokens[!QuijoteTokens %in% SW.Es] |> str_to_lower() |> plyr::count() |> arrange(-freq) QuijoteTokens2 |> head(10) ``` ``` ## x freq ## 1 don 2718 ## 2 quijote 2245 ## 3 sancho 2174 ## 4 si 1968 ## 5 dijo 1808 ## 6 tan 1245 ## 7 así 1065 ## 8 señor 1065 ## 9 respondió 1063 ## 10 ser 1059 ``` --- ## Stemming ```r library(SnowballC) SnowballC::getStemLanguages() # availables languages ``` ``` ## [1] "arabic" "basque" "catalan" "danish" "dutch" ## [6] "english" "finnish" "french" "german" "greek" ## [11] "hindi" "hungarian" "indonesian" "irish" "italian" ## [16] "lithuanian" "nepali" "norwegian" "porter" "portuguese" ## [21] "romanian" "russian" "spanish" "swedish" "tamil" ## [26] "turkish" ``` ```r RamonT <- Ramon |> strsplit("[ ,]") |> unlist() |> str_subset("") SnowballC::wordStem(RamonT, language = "spanish") ``` ``` ## [1] "No" "hay" "trabaj" "mal" "lo" "mal" "es" "ten" ## [9] "que" "trabaj" ``` --- ## Lemmatization - [textstem](https://github.com/trinker/textstem/)::lemmatize_words() - koRpus::treetag, - SnowballC::wordStem - nlp_lemmatizer - [Udpipe](https://cran.r-project.org/web/packages/udpipe/) (support for various languages) -- ```r textstem::lemmatize_strings("walking walked walk") ``` ``` ## [1] "walk walk walk" ``` --- ## Substitutions abbreviations ```r D.subs <- read.table(header=TRUE, sep = ":", text= 'abr:subs Art.:artigo Cel.:coronel Dr.:Doutor Dra.:doutora Drs.:doutores') |> tibble::deframe() # convert df to named vector QuijoteTokens2 |> mutate(x = stringr::str_replace_all(x, D.subs)) ``` --- ## Wordcloud Taking a description of [Don Ramón](https://elchavo.fandom.com/es/wiki/Don_Ram%C3%B3n) ```r Ramon2 <- 'Don Ramón es un hombre viudo de 50 años de edad, que perdió a su mujer durante el parto de su hija, la Chilindrina. Vive en la vecindad junto a ella, en el departamento Nº 72, aunque en los primeros episodios vivía en el departamento de Doña Florinda (el Nº 14). A lo largo de su vida tuvo diferentes oficios (o dice haberlos tenido), como por ejemplo, boxeador, jugador de fútbol americano, torero, guitarrista, cantante, maestro de obras, etc. Recurrentemente ejerce oficios cotidianos, tales como plomero, zapatero, carpintero, yesero, vendedor de globos, mecánico, vendedor de churros, peluquero, ropavejero, lechero, entre muchos. Es un hombre carismático y de buen corazón, pero con un carácter explosivo. Tiene un comportamiento muy impaciente con los niños mal portados; entre otras cosas, le molesta que El Chavo se burle de él por ser delgado y le ponga apodos como "patas de chichicuilote". Al Chavo, Quico y la Chilindrina, los reprende físicamente cuando estos hacen travesuras, lo que causa que siempre sea acusado injustamente por Doña Florinda de intentar o haberle hecho algo malo a Quico, cuando en realidad fue culpa de otra persona (la mayoría del tiempo del Chavo). En estos casos, es reprendido violentamente con una cachetada o hasta con una golpiza fuera de escena, la cual lo deja muy herido sin que Doña Florinda le permita explicarle lo sucedido. Hubo una ocasión en la que Doña Florinda sí le permitió explicarle, y otra en la que Quico le aclaró que el causante de su tristeza fue El Chavo y no Don Ramón, aunque en ambas situaciones, Doña Florinda terminó dándole la cachetada a Don Ramón. Después de cachetearlo, Doña Florinda le dice a Quico que no se junte con esa chusma y luego Quico lo avienta y le dice chusma chusma pfff y en la mayoría de ocasiones Doña Florinda termina diciéndole "Y la próxima vez, vaya a... a su abuela" (en alusión al personaje de Doña Nieves). Sin embargo, a pesar de la enemistad que los caracteriza, hay ocasiones en las que olvidan sus diferencias (en su mayoría momentos especiales), como en un capítulo de Navidad, en el que decide no abofetearlo por ser una fecha especial, y otras en las que Doña Florinda lo felicita por sus actos humanos hacia otros personajes, como cuando él vendía los churros que ella preparaba y se echó la culpa de habérselos comido todos para proteger al Chavo, quien era el verdadero culpable.' ``` ```r wordcloud::wordcloud(Ramon2) ``` ``` ## Loading required namespace: tm ``` <!-- --> --- ```r wordcloud::wordcloud(Ramon2, scale=c(2,.8), colors = c("royalblue","blue", "darkblue")) ``` <!-- --> --- Or using our Quijote dataframe ```r str(QuijoteTokens2) ``` ``` ## 'data.frame': 23521 obs. of 2 variables: ## $ x : chr "don" "quijote" "sancho" "si" ... ## $ freq: int 2718 2245 2174 1968 1808 1245 1065 1065 1063 1059 ... ``` ```r n=40 # only the more frequent wordcloud::wordcloud(QuijoteTokens2$x[1:n], freq = QuijoteTokens2$freq[1:n], colors = c("royalblue","blue", "darkblue")) ``` <!-- --> --- ## ggwordcloud [ggwordcloud](https://lepennec.github.io/ggwordcloud/articles/ggwordcloud.html) wordcloud using the grammar of graphics ```r library(ggwordcloud) head(QuijoteTokens2, 50) |> ggplot() + aes(label=x, size=freq, color=freq ) + ggwordcloud::geom_text_wordcloud_area(shape = "circle") + # diamond square scale_size_area(max_size = 30) + theme_minimal() + scale_color_gradient(low = "lightblue", high = "darkred") + labs(title="Wordcloud with ggwordcloud", subtitle = "Words from the Book Don Quijote", caption = "LatinR") ``` <!-- --> --- Other wordcloud packages - [wordcloud2](https://cran.r-project.org/web/packages/wordcloud2/) - in [Quanteda](https://quanteda.io/reference/textplot_wordcloud.html) is possible to make comparisions wordclouds --- --- class: chapter-slide ## Sentiment Analysis --- ## Sentiment Analysis <!-- # 3.3 - Dictionary methods: concept frequency and basic sentiment analysis --> ``` ## ## Attaching package: 'SentimentAnalysis' ``` ``` ## The following object is masked from 'package:base': ## ## write ``` ```r documentos <- c("Café é bom", "Rúcula é ruim", "Bem mais ou menos", "refrigerante é péssimo", "ruim é ter de trabalhar") # creating our dictionay dict_pt <- SentimentDictionaryBinary( # vetor de palavras com conotação positiva c("bom","boa", "excelente"), # vetor de palavras com conotação negativa c("ruim", "péssimo")) AnaliseSentimentos <- analyzeSentiment(documentos, language="portuguese", rules=list("pontos"=list(ruleSentiment, dict_pt))) ``` --- ```r AnaliseSentimentos ``` ``` ## pontos ## 1 0.5000000 ## 2 -0.5000000 ## 3 0.0000000 ## 4 -0.5000000 ## 5 -0.3333333 ``` ```r # Apresentando este resultado de forma mais legível para humanos, como um data frame. sentimentosDF <- data.frame(frase = documentos, score.sentimentos = AnaliseSentimentos, # convertendo os valores em direções: pos., neg. e neutro sentimento = convertToDirection(AnaliseSentimentos$pontos)) sentimentosDF ``` ``` ## frase pontos sentimento ## 1 Café é bom 0.5000000 positive ## 2 Rúcula é ruim -0.5000000 negative ## 3 Bem mais ou menos 0.0000000 neutral ## 4 refrigerante é péssimo -0.5000000 negative ## 5 ruim é ter de trabalhar -0.3333333 negative ``` --- ## Weighted dictionary ```r d <- SentimentDictionaryWeighted(c("péssimo", "ruim", "bom", "boa", "excelente", "top"), c(-10, -5, +5, +5, +10, +10),) d ``` ``` ## Type: weighted (words with individual scores) ## Intercept: 0 ## -10.00 péssimo ## -5.00 ruim ## 5.00 bom ## 5.00 boa ## 10.00 excelente ## 10.00 top ``` --- --- class: chapter-slide # Visualization --- ```r library(ggpage) library(dplyr) library(ggplot2) head(tinderbox, 4) ``` ``` ## # A tibble: 4 × 2 ## text book ## <chr> <chr> ## 1 "A soldier came marching along the high road: \"Left, right - left, rig… The … ## 2 "had his knapsack on his back, and a sword at his side; he had been to … The … ## 3 "and was now returning home. As he walked on, he met a very frightful-l… The … ## 4 "witch in the road. Her under-lip hung quite down on her breast, and sh… The … ``` ```r str(tinderbox) ``` ``` ## tibble [211 × 2] (S3: tbl_df/tbl/data.frame) ## $ text: chr [1:211] "A soldier came marching along the high road: \"Left, right - left, right.\" He" "had his knapsack on his back, and a sword at his side; he had been to the wars," "and was now returning home. As he walked on, he met a very frightful-looking old" "witch in the road. Her under-lip hung quite down on her breast, and she stopped" ... ## $ book: chr [1:211] "The tinder-box" "The tinder-box" "The tinder-box" "The tinder-box" ... ``` --- The function `ggpage_quick()` creates a quick visualization of the page: ```r ggpage_quick(tinderbox) ``` <!-- --> --- - The function `ggpage_build()` creates a tibble from the list, that give us more options dealing with dataframes ```r tinderbox |> # list object ggpage_build() |> # into tibble mutate(word = stringr::str_extract(word, "soldier")) |> # Creates a visualization ggpage_plot(aes(fill = word)) # plot ``` <!-- --> --- ```r tinderbox |> # list ggpage_build() |> # creates inton tibble mutate(long_word = stringr::str_length(word) > 8) |> ggpage_plot(aes(fill = long_word)) + # Creates a visualization from the ggpage_build labs(title = "Palavras longas através do texto Tinder-box") + scale_fill_manual(values = c("grey70", "blue"), # colors labels = c("8 ou menos", "9 ou mais"), name = "Comprimento das palavras") ``` <!-- --> --- --- ## Semantic Parsing Check the language models available [here](https://ufal.mff.cuni.cz/udpipe/1/models) ```r dl <- udpipe_download_model(language = "portuguese-bosque") dl <- udpipe_download_model(language = "spanish-ancora") ``` ```r dl <- udpipe_load_model("~/Documentos/Tablet-PC-Sync/eventos/LatinR/Rmd/spanish-ancora-ud-2.5-191206.udpipe") str(dl) ``` ``` ## List of 2 ## $ file : chr "/home/alisson/Documentos/Tablet-PC-Sync/eventos/LatinR/Rmd/spanish-ancora-ud-2.5-191206.udpipe" ## $ model:<externalptr> ## - attr(*, "class")= chr "udpipe_model" ``` --- ```r anotado <- udpipe::udpipe_annotate(dl, x = Quijote[50:100]) |> as.data.frame() head(anotado, 10) ``` ``` ## doc_id paragraph_id sentence_id sentence ## 1 doc3 1 1 Al Duque de Béjar ## 2 doc3 1 1 Al Duque de Béjar ## 3 doc3 1 1 Al Duque de Béjar ## 4 doc3 1 1 Al Duque de Béjar ## 5 doc6 1 1 Prólogo ## 6 doc9 1 1 Al libro de don Quijote de la Mancha ## 7 doc9 1 1 Al libro de don Quijote de la Mancha ## 8 doc9 1 1 Al libro de don Quijote de la Mancha ## 9 doc9 1 1 Al libro de don Quijote de la Mancha ## 10 doc9 1 1 Al libro de don Quijote de la Mancha ## token_id token lemma upos xpos ## 1 1 Al al ADP ADP ## 2 2 Duque Duque PROPN PROPN ## 3 3 de de ADP ADP ## 4 4 Béjar Béjar PROPN PROPN ## 5 1 Prólogo próloer VERB VERB ## 6 1 Al al ADP ADP ## 7 2 libro libro NOUN NOUN ## 8 3 de de ADP ADP ## 9 4 don don NOUN NOUN ## 10 5 Quijote Quijote PROPN PROPN ## feats head_token_id dep_rel ## 1 AdpType=Preppron 2 case ## 2 <NA> 0 root ## 3 AdpType=Prep 4 case ## 4 <NA> 2 flat ## 5 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root ## 6 AdpType=Preppron 2 case ## 7 Gender=Masc|Number=Sing 0 root ## 8 AdpType=Prep 4 case ## 9 Gender=Masc|Number=Sing 2 nmod ## 10 <NA> 4 flat ## deps misc ## 1 <NA> <NA> ## 2 <NA> <NA> ## 3 <NA> <NA> ## 4 <NA> SpacesAfter=\\n ## 5 <NA> SpacesAfter=\\n ## 6 <NA> <NA> ## 7 <NA> <NA> ## 8 <NA> <NA> ## 9 <NA> <NA> ## 10 <NA> <NA> ``` ```r plyr::count(anotado, "upos") |> arrange(-freq) ``` ``` ## upos freq ## 1 NOUN 50 ## 2 ADP 45 ## 3 DET 33 ## 4 PRON 28 ## 5 VERB 22 ## 6 ADJ 18 ## 7 PROPN 11 ## 8 CCONJ 7 ## 9 ADV 4 ## 10 PUNCT 3 ## 11 SCONJ 3 ## 12 <NA> 1 ``` --- ```r anotado |> select(upos, lemma) |> filter(upos == "NOUN") |> #, "ADJ", "VERB" plyr::count() |> arrange(-freq) ``` ``` ## upos lemma freq ## 1 NOUN don 9 ## 2 NOUN caballero 4 ## 3 NOUN suceso 3 ## 4 NOUN cabrero 2 ## 5 NOUN fin 2 ## 6 NOUN hidalgo 2 ## 7 NOUN salida 2 ## 8 NOUN aventura 1 ## 9 NOUN barbero 1 ## 10 NOUN batalla 1 ## 11 NOUN condición 1 ## 12 NOUN cuento 1 ## 13 NOUN cura 1 ## 14 NOUN desgracia 1 ## 15 NOUN ejercicio 1 ## 16 NOUN escrutinio 1 ## 17 NOUN gallardo 1 ## 18 NOUN librería 1 ## 19 NOUN libro 1 ## 20 NOUN manera 1 ## 21 NOUN molinos 1 ## 22 NOUN narración 1 ## 23 NOUN pastora 1 ## 24 NOUN peligro 1 ## 25 NOUN Quijote 1 ## 26 NOUN recordación 1 ## 27 NOUN tierra 1 ## 28 NOUN turba 1 ## 29 NOUN valiente 1 ## 30 NOUN venta 1 ## 31 NOUN viento 1 ## 32 NOUN vizcaíno 1 ## 33 NOUN yangüés 1 ``` --- ### wordnet ```r coocorrencias <- cooccurrence(x = subset(anotado, upos %in% c("NOUN", "ADJ", "VERB")), term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id")) wordnetwork <- head(coocorrencias, 60) wordnetwork <- igraph::graph_from_data_frame(wordnetwork) ``` ```r plotGgraphWordNet <- ggraph::ggraph(wordnetwork, layout = "fr") + ggraph::geom_edge_link(aes(width = cooc/3, alpha = 0.35), edge_colour = "darkred") + # ggraph::geom_edge_density(aes(fill = cooc)) + ggraph::geom_node_text(aes(label = name), col = "black", size = 3.5) + ggraph::theme_graph(base_family = "Garamond") plotGgraphWordNet ``` <!-- --> --- --- class: chapter-slide # Further steps --- ## Further steps <!--  -->  - SILGE, Julia. [Text Mining with R](https://www.tidytextmining.com/)  - JURAFSKY, D.; MARTIN, J. Speech and language processing: An introduction to speech recognition, computational linguistics and natural language processing. Upper Saddle River, NJ: Prentice Hall, 2020. [link pdf dos capítulos individuais](https://web.stanford.edu/~jurafsky/slp3/), [link livro completo](https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf). (Um manual bastante extenso e mais teórico sobre PLN)